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Abstract 

Many complex disease syndromes such as asthma consist of a large number of highly related, 
rather than independent, clinical phenotypes, raising a new technical challenge in identifying ge- 
netic variations associated simultaneously with correlated traits. In this study, we propose a new 
statistical framework called graph-guided fused lasso (GFlasso) to address this issue in a principled 
way. Our approach explicitly represents the dependency structure among the quantitative traits as 
a network, and leverages this trait network to encode structured regularizations in a multivariate 
regression model over the genotypes and traits, so that the genetic markers that jointly influence 
subgroups of highly correlated traits can be detected with high sensitivity and specificity. While 
most of the traditional methods examined each phenotype independently and combined the re- 
sults afterwards, our approach analyzes all of the traits jointly in a single statistical method, and 
borrow information across correlated phenotypes to discover the genetic markers that perturbe a 
subset of correlated triats jointly rather than a single trait. Using simulated datasets based on the 
HapMap consortium data and an asthma dataset, we compare the performance of our method with 
the single-marker analysis, and other sparse regression methods such as the ridge regression and 
the lasso that do not use any structural information in the traits. Our results show that there is a 
significant advantage in detecting the true causal SNPs when we incorporate the correlation pattern 
in traits using our proposed methods. 
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1 Introduction 

Recent advances in high-throughput genotyping technologies have significantly reduced the cost 
and time of genome- wide screening of individual genetic differences over millions of single nu- 
cleotide polymorphism (SNP) marker loci, shedding light to an era of "personalized genome" [[B 
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|2l|. Accompanying this trend, clinical and molecular phenotypes are being measured at phenome 
and transcriptome scale over a wide spectrum of diseases in various patient populations and lab- 
oratory models, creating an imminent need for appropriate methodology to identify omic-wide 
association between genetic markers and complex traits which are implicative of causal relation- 
ships between them. Many statistical approaches have been proposed to address various challenges 
in identifying genetic locus associated with the phenotype from a large set of markers, with the 
primary focus on problems involving a univariate trait [i3l|4l|5]|. However, in modem studies the 
patient cohorts are routinely surveyed with a large number of traits (from measures of hundreds of 
clinical phenotypes to genome-wide profiling of thousands of gene expressions), many of which 
are correlated among them. For example, in Figure [B the correlation structure of the 53 clinical 
traits in the asthma dataset collected as a part of the Severe Asthma Research Program (SARP) [[6| 
is represented as a network, with each trait as a node, the interaction between two traits as an edge, 
and the thickness of an edge representing the strength of correlation. Within this network, there ex- 
ists several subnetworks involving a subset of traits, and furthermore, the large subnetwork on the 
left-hand side of Figure [T] contains two subgroups of densely connected traits with thick edges. In 
order to understand how genetic variations in asthma patients affect various asthma-related clinical 
traits in the presence of such a complex correlation pattern among phenotypes, it is necessary to 
consider all of the traits jointly and take into account their correlation structure in the association 
analysis. Although numerous research efforts have been devoted to studying the interaction pat- 
terns among many quantitative traits represented as networks [|7l[8l|9l[T0l[TTl[T2l[T3l[T4l as well as 
discovering network submodules from such networks [[TTlfTSl . this type of network structure has 
not been exploited in association mapping [16lll7|. Many of the previous approaches examined 
one phenotype at a time to localize the SNP markers with a significant association and combined 
the results from a set of such single-phenotype association mapping across phenotypes. How- 
ever, we conjecture that one can detect additional weak associations and at the same time reduce 
false signals by combining the information across multiple phenotypes under a single statistical 
framework. 

In QTL mapping studies with pedigree data, a number of approaches have been proposed to 
detect pleiotropic effect of markers on multiple correlated traits by considering the traits jointly. 
However, these approaches involve only a weak and indirect form of structural information present 
in the phenotypes. The methods based on multivariate regression [|T8l [T9l f20ll with multiple out- 
comes were concerned with finding genetic loci that influence all of the phenotypes jointly, rather 
than explicitly taking into account the complex interaction patterns among the phenotypes. A dif- 
ferent approach has been proposed that first applies principle component analysis (PCA) to the 
phenotypes and uses the transformed phenotypes in a single-phenotype association test ETl l22ll . 
The transformation via PCA allows to extract the components that explain the majority of variation 
in phenotypes, but has a limitation in that it is not obvious how to interpret the derived phenotypes. 

More recently, in expression quantitative trait locus (eQTL) analysis with microarray gene 
expression measurements treated as quantitative traits, researchers have begun to combine an ex- 
plicit representation of correlation structure in phenotypes, such as gene networks, with genotype 
information to search for genetic causes of perturbations of a subset of highly correlated pheno- 
types [|23l l24l |25l |26ll . A module network fTTl, which is a statistical model developed for un- 
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Figure 1 : Illustration of association analysis using phenotype correlation graph for asthma dataset 

covering regulatory modules from gene expression data, was extended to incorporate genotypes 
of regulators, such that the expression of genes regulated by the same regulator was explained by 
the variation of both the expression level and the genotype of the regulators [23]. Although the 
model was able to identify previously unknown genetic perturbations in yeast regulatory network, 
the genotype information used in the model was limited to markers in regulators rather than the 
whole genome. Several other studies incorporated a gene co-regulation network in a genome- wide 
scan for association. In a network eQTL association study for mouse [i24ll . a gene co-regulation 
network was learned, a clustering algorithm was applied to this network to identify subgroups of 
genes whose members participate in the same molecular pathway or biological process, and then, 
a single-phenotype analysis was performed between genotypes and the phenotypes within each 
subgroup. If the majority of phenotypes in each subgroup were mapped to the common locus in 
the genome, that locus was declared to be significantly associated with the subgroup. Using this 
approach, new obesity-related genes in mouse were identified by examining the network module 
associated with the genetic locus previously associated with obesity-related traits such as body 
mass index and cholesterol level. A similar analysis was performed on yeast, where clusters of 
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Figure 2: Illustrations for multiple output regression with A: lasso, B: graph-constrained fused 
lasso, and C: graph-weighted fused lasso. 

yeast genes were mapped to a common eQTL hotspots f26\. One of the main disadvantages of this 
approach is that it first applies a clustering algorithm to identify subgroups of phenotypes in the 
network, rather than directly incorporating the network itself as a correlation structure, since the 
full network contains much richer information about complex interaction patterns than the clusters 
of phenotypes. Another disadvantage of this approach is that it relies on a set of single-phenotype 
statistical tests and combines the results afterwards in order to determine whether a marker is sig- 
nificantly affecting a subgroup of phenotypes, thus requiring a substantial effort in conducting 
appropriate multiple hypothesis testing. We believe that an approach that considers markers and 
all of the phenotypes jointly in a single statistical method has the potential to increase the power 
of detecting weak associations and reduce susceptibility to noise. 

In this article, we propose a family of methods, called the graph-guided fused lasso (GFlasso), 
that fully incorporates the quantitative-trait network as an explicit representation for correlation 
structure without applying additional clustering algorithms to phenotypes. Our methods combine 
multiple phenotypes in a single statistical framework, and analyze them jointly to identify SNPs 
perturbing a subset of tighly correlated phenotypes instead of combining results from multiple 
single-phenotype analyses. The proposed methods leverage a dependency graph defined on multi- 
ple quantitative traits such as the graph for the asthma-related traits shown in Figure [U assuming 
that such a graph structure is available from preprocessing steps or as prior knowledge from pre- 
vious studies. It is reasonable to assume that when a subset of phenotypes are highly correlated, 
the densely connected subgraphs over these correlated traits contain variables that are more likely 
to be synergistically influenced by the same or heavily overlapping subset(s) of SNPs with similar 
strength than an arbitrary subset of phenotypes. 

The proposed approach is based on a multivariate regression formalism with the Li penalty, 
commonly known as the lasso, that achieves "sparsistency" in the estimated model by setting many 
of the regression coefficients for irrelevant markers to zero [127112811. As a brief digression for clar- 
ity, sparsistancy refers to an asymptotic property in high-dimensional statistical inference that for 
the estimator of a p-dimensional vector 6 from n iid samples, where p can be ^ n, the probabil- 
ity of recovering the true non-zero elements S = {i : 7^ 0} in the estimator approach one 
in the limit, if the true non-zero elements are sparse in the sense that \S\ < n <^ p [|28ll . This 
property of the lasso makes it a natural approach for genome wide association analysis, where 
the marker genotypes are treated as the predictors, the phenotype in question is treated as the re- 
sponse, and the (sparse) set of markers having non-zero regression coefficients are interpreted as 
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the markers truely associated with the phenotype. However, when applied to association mapping 
with multivariate traits, the lasso is equivalent to a single-trait analysis that needs to be repeated 
over every single trait. In other words, for a collection of traits, each trait would be treated as 
independent of all other traits, and every trait would be regressed on a common set of marker 
genotypes via its own lasso (Figure [2j^), ignoring the possible coupling among traits. Our innova- 
tions in GFlasso that enable a departure from the baseline lasso for a single trait is that, in addition 
to the lasso penalty, we employ a "fusion penalty" that fuses regression coefficients across cor- 
related phenotypes, using either unweighed or weighted connectivity of the phenotype graph as 
a guide. This additional penalty will encourage sharing of common predictors (i.e., associated 
markers) to coupled responses (i.e., traits). The two fusion schemes lead to two variants of the 
GFlasso: a graph-constrained fused lasso (GcFlasso) based on only the graph topology (Figure 
|2^), and a graph-weighted fused lasso (GcFlasso) that offers a flexible range of stringency of the 
graph constraints through edge weights (Figure [2t). We developed an efficient algorithm based on 
quadratic programming combined with gradient search for estimating the regression coefficients 
under GFlasso. The results on two datasets, one simulated from HapMap SNP markers and the 
other collected from asthma patients, show that our method outperforms competing algorithms in 
identifying markers that are associated with a correlated subset of phenotypes. 

2 Material and Methods 

2.1 Lasso Regression for Multiple Independent Phenotypes 

Let X be an X J matrix of genotypes for individuals and J SNPs, where each element Xij 
of X is assigned 0, 1, or 2 according to the number of minor alleles at the j-th locus of the z-th 
individual. Let Y denote an x K matrix of K quantitative trait measurements over the same set 
of individuals. We use to denote the k-\h column of Y. A conventional single-trait association 
via linear regression model can be applied to this multiple-trait setting by fitting the model to X 
and each of the K traits y^'s separately: 

yfe = X/3, + efc, \/k = l,...,K, (1) 

where is a J- vector of regression coefficients for the A;-th trait that can be used in a statistical test 
to detect SNP markers with significant association, and is a vector of independent error terms 
with mean and a constant variance. We center each column of X and Y such that ^ . yik = and 
J2i ^ij = 0' ^iid consider the model in Equation ([T]) without an intercept. We obtain the estimates 
of B = {/3^, . . . , f3j^} by minimizing the resisual sum of squares: 

B = argmin5^(yfc-X/3fc)^-(yfc-X/3,). (2) 

k 

In a typical genome-wide association mapping, one examines a large number of marker loci with 
the goal of identifying the region associated with the phenotypes and markers in that region. A 
straight-forward application of the linear regression method in Equation (O to association mapping 
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with large J can cause several problems such as an unstable estimate of regression coefficients and 
a poor interpretability due to many irrelevant markers with non-zero regression coefficients. Sparse 
regression methods such as forward stepwise selection [29], ridge regression [30l'5l|, and lasso [27] 
that select a subset of markers with true association have been proposed to handle the situation with 
large J. Forward stepwise selection method iteratively selects one relevant marker at a time while 
trying to improve the model fit based on Equation ([2]), but it may not produce an optimal solution 
becuase of the greedy nature of the algorithm. Ridge regression has an advantage of performing the 
selection in a continuous space by penalizing the residual sum of square in Equation ^ with the L2 
norm of /S^^'s and shrinking the regression coefficients toward zero, but it does not set the regression 
coefficients of irrelevant markers to exactly zero. We use the lasso that penalizes the residual sum 
of square with the Li norm of regression coefficients and has the property of setting regression 
coefficients with weak association markers exactly to zero, thus offering the advantages of both 
forward stepwise selection and ridge regression. The lasso estimate of the regression coefficients 
can be obtained by solving the following: 

Biasso ^ argmin5^(yfc-X/3,f ■(yfc-X^,) + A5^|/?fc,| (3) 

k k,j 

where A is a regularization parameter that controls the amount of sparsity in the estimated re- 
gression coefficients. Setting A to a large value increases the amount of penalization, setting more 
regression coefficients to zero. Many fast algorithms are available for solving Equation (|3]) [|27ll3TI . 

The lasso for multiple-trait association mapping in Equation ^ is equivalent to solving a set 
of K independent regressions for each trait with its own Li penalty, and does not provide a mech- 
anism to combine information across multiple traits such that the estimates reflect the potential 
relatedness in the regression coefficients for those correlated traits that are influenced by common 
SNPs. However, several traits are often highly correlated such as in gene expression of co-regulated 
genes in eQTL study, and there might be genotype markers that are jointly associated with those 
correlated traits. Below, we extend the standard lasso and propose a new penalized regression 
method for detecting markers with pleiotropic effect on correlated quantitative traits. 

2.2 Graph-Guided Fused Lasso for Multiple Correlated Phenotypes 

In order to identify markers that are predictive of multiple phenotypes jointly, we represent the 
correlation structure over the set of K traits as an edge- weighted graph, and use this graph to guide 
the estimation process of the regression coefficients within the lasso framework. We assume that 
we have available from a pre-processing step a phenotype correlation graph G consisting of a set 
of nodes V, each representing one of the K traits, and a set of edges E. In this article, we adopt 
a simple and commonly-used approach for learning such graphs, where we first compute pairwise 
Pearson correlation coefficients for all pairs of phenotypes using yfc's, and then connect two nodes 
with an edge if their correlation coefficient is above the given threshold p. We set the weight 
of each edge (m, /) E E to the absolute value of correlation coefficient Ir^,;], so that the edge 
weight represents the strength of correlation between the two nodes. This thresholded correlation 
graph is also known as a relevance network, and has been widely used as a representation of gene 



6 



interaction networks [|32l[33l . It is worth pointing out that the choice of methods for obtaining the 
phenotype network is not a central issue of our method. Other variations of the standard relevance 
network have been suggested [34], and any of these graphs can also be used within our proposed 
regression methods. Below, we first introduce GcFlasso that makes use of unweighted graph, 
and further extend this method to GcFlasso to take into account the full information in the graph 
including edge weights. 

Given the correlation graph of phenotypes, it is reasonable to assume that if two traits are 
highly correlated and connected with an edge in the graph, their variation across individuals might 
be explained by genetic variations at the same loci, possibly having the same amount of influence 
on each trait. In GcFlasso, this assumption is expressed as an additional penalty term that fuses 
two regression coefficients Pjm and (3ji for each marker j if traits m and / are connected with an 
edge in the graph, as follows: 

B«c = argmin J](y, - Xf3,f ■ (y^ - X(3,) 

k 

k j {m,l)eE j 

where A and 7 are regularization parameters that determine the amount of penalization. The last 
term in Equation (HI) is called a fusion penalty [|35ll . and encourages Pjm and sign(r m, i) Pji to take 
the same value by shrinking the difference between them toward zero. A larger value for 7 leads to 
a greater fusion effect, or greater sparsity in \Pjm — sign(rm,/)/3j7rs. We assume that if two traits m 
and / connected with an edge in G are negatively correlated with r„a < 0, the effect of a common 
marker on those traits takes an opposite direction, and we fuse Pj„i and {—[3ji), or equivalently, 
l3jm and sign(r,„,i)/?j«- When the fusion penalty is combined with the lasso penalty as in Equation 
(HI), the lasso penalty sets many of the regression coefficients to zero, and for the remaining non- 
zero regression coefficients, the fusion penalty flattens the values across multiple highly correlated 
phenotypes for each marker so that the strength of influence of each marker becomes similar across 
those correlated traits. The idea of fusion penalty has been first used in the classical regression 
problem over univariate response (i.e., single-output) from high-dimensional covariates to fuse the 
regression coefficients of two adjacent covariates when the covariates are assumed to be ordered 
such as in time [35]. This corresponds to coupling pairs of elements in the adjacent rows of the 
same column in the coefficient matrix B in Equation dH). In GcFlasso, we employ a similar strategy 
in a multiple-output regression in order to identify pleiotropic effect of markers, and let the trait 
correlation graph determine which pairs of regression coefficients should be fused. Now, every 
such coupled coefficient pair corresponds to the elements of the corresponding two columns in the 
same row of matrix B in Equation (H)). It is possible to show the asymptotic properties of estimators 
of the GFlasso methods as iV — > 00 analogous to the ones previously shown for the lasso and the 
fused lasso [|35l[36]|. 

In a multiple-trait association mapping, networks of clinical traits or molecular traits (i.e., gene 
expressions) typically contain many subnetworks within which nodes are densely connected, and 
we are interested in finding the genetic variants that perturb the entire set of traits in each subnet- 
work. This can potentially increase the power of detecting weak associations between genotype 
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and phenotype that may be missed when each phenotype is considered independently. When used 
in this setting, the GcFlasso looks for associations between a genetic marker and a subgraph of 
phenotype network rather than a single phenotype. Unlike other previous approaches for detecting 
pleiotropic effect that first apply clustering algorithms to learn subgroups of traits and then search 
for genetic variations that perturb the subgroup, GcFlasso uses the full information on correla- 
tion structure in phenotypes available as a graph, where the subgroup information is embedded 
implicitly within the graph as densely connected subgraphs. Although the fusion penalty in the 
GcFlasso is applied locally to a pair of regression coefficients for neighboring trait pairs in the 
graph, this fusion effect propagates to the regression coefficients for other traits that are connected 
to them in the graph. For densely connected nodes, the fusion is effectively applied to all of the 
members of the subgroup, and the set of non-zero regression coefficients tend to show a block 
structure with the same values across the correlated traits given a genotype marker with pleiotropic 
effect on those traits, as we demonstrate in experiments. If the edge connections are sparse within 
a group of nodes, the corresponding traits are only weakly related, and there is little propagation 
of fusion effect through edges in the subgroup. Thus, the GcFlasso incorporates the subgrouping 
information through the trait correlation graph in a more flexible manner compared to previous 
approaches. 

Now, we present a further generalization of GcFlasso that exploit the full information in the 
phenotype networks for association mapping. Note that the only structural information used in 
the GcFlasso is the presence or absence of edges between two phenotypes in the graph. The 
GwFlasso is a natural extension of the GcFlasso that takes into account the edge weights in 
graph G in addition to the graph topology. The G„Flasso weights each term in the fusion penalty 
in Equation dH) by the amount of correlation between the two phenotypes being fused, so that the 
amount of correlation controls the amount of fusion. More generally, G„Flasso weights each term 
in the fusion constraint in Equation dH) with a monotonicaly increasing function of the absolute 
values of correlations, and finds an estimate of the regression coefficients as follows: 



If the two phenotypes m and / are highly correlated in graph G with a relatively large edge weight, 
the fusing effect increases between these two phenotypes since the difference between the two cor- 
responding regression coefficients (3jm and Pji is penalized more than for other pairs of phenotypes 
with weaker correlation. In this article, we consider /i(r) = |r| for GcFlasso and f2ir) = for 
GcFlasso. We note that the GcFlasso is a special case of the GwFlasso with /(r) = 1. 

The optimization problems in Equations dH) and © can be formulated as a quadratic pro- 
gramming as described in Appendix, and there are many publicly available software packages that 
efficiently solve such quadratic programming problems. The regularization parameters A and 7 
can be determined by a cross-validation or a validation set, although for a large problem, a grid 
search for the A and 7 can be time-consuming. In order to improve the efficiency in determining 
the A and 7, we use a gradient-descent type of algorithm as described in Appendix. 



= argmin ^^(y, - X(3,f ■ (y^ - X/3,) 



k 




(5) 
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2.3 Simulation Scheme for Model Validation 



We simulated genotype data for 250 individuals based on the HapMap data in the region of 8.79- 
9.20M in chromosome 7. The first 60 individuals of the genotype data came from the parents 
of the HapMap CEU panel. We generated genotypes for additional 190 individuals by randomly 
mating the original 60 individuals on the CEU panel. We included only those SNPs with minor 
allele frequency greater than 0. 1 . Since our primary goal is to measure the performance of the 
association methods in the case of multiple correlated phenotypes, we sampled 50 SNPs randomly 
from the 697 SNPs in the region in order to reduce the correlation among SNPs from the linkage 
disequilibrium. 

Given the simulated genotype, we generated the true associations represented as regression 
coefficients B and phenotype data as follows. We assumed that the number of phenotypes is 10, 
and that there are three groups of correlated phenotypes of size 3, 3, and 4, respectively, so that the 
phenotypes in each group form a subnetwork in the correlation graph of phenotypes. We randomly 
selected three SNPs as affecting all of the phenotypes in the first subnetwork, and four SNPs as 
influencing each of the remaining two subnetwork. We assumed that there is one additional SNP 
affecting phenotypes in the first two subnetworks, which corresponds to the case of a SNP perturb- 
ing a super-network consisting of two subnetworks such as the large subnetwork on the left-hand 
side of Figure [B In addition, we assumed one additional SNP affecting all of the phenotypes. We 
set the effect size of all of the true association SNPs to the same value. Once we set the regres- 
sion coefficients according to this setup, we generated the phenotype data with noise distributed as 
A^(0, 1), using the simulated genotypes as covariates. 

2.4 Asthma Dataset 

We apply our methods to data collected from 543 asthma patients as a part of the Severe Asthma 
Research Program (SARP). The genotype data were obtained for 34 SNPs within or near IL-4R 
gene that spans a 40kb region on chromosome 16. This gene has been previously shown to be 
implicated in severe asthma [37] . We used the publicly available software PHASE [|38l to impute 
missing alleles and phase the genotypes. The phenotype data included 53 clinical traits related to 
severe asthma such as age of onset, family history, and severity of various symptoms. The phe- 
notype correlation graph thresholded at 0.7 as shown in Figure [T] reveals several subnetworks of 
correlated traits. For example, the subset of traits related to lung physiology (the nodes for base- 
lineFEVl, PreFEFPred, PostbroPred, PredrugFEVlP, Max FEVIP, etc.) form a large subnetwork 
on the left-hand side of Figure [H whereas traits representing quality of life of the patients (the 
nodes for AQLQ Environment, AQLQSymptom, AQLQ Emotion, and AQLQ Activity) are found 
in a small subnetwork near the center of Figure [TJ Our goal is to examine whether any of the SNPs 
in the IL-4R gene are associated with a subnetwork of correlated traits rather than an individual 
trait. We standardized measurements for each phenotype to have mean and standard deviation 1 
so that their values are roughly in the same range across phenotypes. 
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Figure 3: ROC curves for comparison of association analysis methods with different sample size 
A^. A: N=50, B: iV=100, C: iV=150, D: N=200, and E: A^=250. The effect size is 0.5, and 
the threshold p for the phenotype network is set to 0.3. Note that the curves for the GcFlasso, 
GcFlasso, and GcFlasso almost entirely overlap. 



3 Results 

We compare the results from our proposed methods, GcFlasso , GcFlasso and GcFlasso, with the 
ones from the single-marker analysis as well as multivariate regression methods such as the ridge 
regression and the lasso that do not use any structural information in the phenotypes. In the ridge 
regression, we set the regularization parameter to 0.0001, which is equivalent to adding a small 
value of 0.0001 to the diagonal of X'X to make the standard regression problem non-singular. For 
the lasso and our proposed methods, we used the gradient-descent search for the regularization pa- 
rameters A and 7 as described in Appendix. We used (-log(p- value)) for the standard single-marker 
analysis, and the absolute value of regression coefficients \f3jk\'s for the multivariate regression 
methods and our proposed methods, as a measure of the strength of association. 



3.1 Simulation Study 

We evaluate the performance of the association methods based on two criteria, sensitivity /specificity 
and phenotype prediction error. The sensitivity and specificity measure whether the given method 
can successfully detect the true association SNPs with few false positives. The (1 -specificity) and 
sensitivity are equivalent to type I error rate and (1-type II error rate), and their plot is widely known 
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Figure 4: ROC curves for comparison of association analysis methods with varying effect size. 
Effect size is A: 0.3, B: 0.5, C: 0.8, and D: 1.0. The sample size is 100, and the threshold p for the 
phenotype correlation graph is 0.1. 
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Figure 5: ROC curves for comparison of association analysis methods with different values of 
threshold (p) for the phenotype correlation network. A: p=0.1, B: p=0.3, C: p=0.5, and D: p=0.7. 
The sample size is 100, and the effect size is 0.8. 



as a receiver operating characteristic (ROC) curve. The phenotype prediction error represents how 
accurately we can predict the values of phenotypes given the genotypes of new individuals, using 
the regression coefficients estimated from the previously available genotype and phenotype data. 
We generate additional dataset of 50 individuals, y"*^* and X"™, and compute the phenotype pre- 
diction error as sum of squared differences between the true values y"™ and predicted values y"™ 
of the phenotypes, ^^iyT^ - Y"^^)' ' (yT^ - y"'"')' where y^""^ = X"'="'^^. For both criteria for 
measuring performance, we show results averaged over 50 randomly generated datasets. 

In the results shown below, for each dataset of size N, we fit the lasso and the graph-guided 
methods using (A^ — 30) samples, and use the remaining 30 samples as a validation set for deter- 
mining the regularization parameters. Once we determine the regularization parameters, we use the 
entire dataset of size to estimate the final regression coefficients given the selected regularization 
parameters. 

We apply the various association methods to datasets with varying sample sizes, and show the 
ROC curves in Figure [3l We used the threshold p=0.3 to obtain the phenotype correlation graph, 
and set the effect size to 0.5. The results confirm that the lasso is an effective method for detecting 
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Figure 7: Results of association analysis by different methods based on a single simulated dataset. 
Effect size 0.8 and threshold p = 0.3 for the phenotype correlation graph are used. Bright pixels 
indicate large values. A: The correlation coefficient matrix of phenotypes, B: the edges of the 
phenotype correlation graph obtained at threshold 0.3 are shown as white pixels, C: The true re- 
gression coefficients used in simulation. Rows correspond to SNPs and columns to phenotypes. 
D: -log(p- value). Absolute values of the estimated regression coefficients are shown for E: ridge 
regression, F: lasso, G: GcFlasso, H: G^Flasso, and I: G^Flasso. 



true causal SNPs and is affected less by the irrelevant SNPs, compared to the single-marker analysis 
and ridge regression. When we use the weighted fusion penalty in addition to the lasso penalty 
as in GcFlasso, GcFlasso, and GcFlasso, the performance significantly improves over the lasso 
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Figure 8: Comparison of the computation time for lasso, GcFlasso, G^Flasso, and G^Flasso. A: 
Varying the number of SNPs with the number of phenotypes fixed at 10. The phenotype correlation 
graph at threshold p = 0.3 with 31 edges is used. B: Varying the number of phenotypes with the 
number of SNPs fixed at 50. The phenotype networks are obtained using threshold p = 0.3. 
The number of edges in each phenotype network is 11, 34, 53, 88, and 142 for the number of 
phenotypes 10, 20, 30, 40, and 50, respectively. 



across all of the samples sizes shown in Figure [3l 

In order to see how the effect size affects the performance of the methods for association anal- 
ysis, we vary the effect size and show the ROC curves in Figure |4l for the threshold p=0. 1 of the 
phenotype correlation network and sample size = 100. The GcFlasso and GcFlasso outper- 
form all of the other methods across all of the effect sizes. Because of the relatively low value 
of the threshold p = 0.1, the correlation phenotype contains many edges between a pair of phe- 
notypes that are only weakly correlated. Thus, the GcFlasso that does not distinguish edges for 
strong correlation from those for weak correlation does not show a consistent performance across 
different effect size, performing better than the lasso at effect size 0.3 but worse than the lasso at 
effect size 1.0. The GcFlasso and GcFlasso have the flexibility to handle different strengths of 
correlation in the graph, and consistently outperforms GcFlasso as well as the methods that do not 
consider the structural information in the phenotypes. 

In order to examine the effect of the threshold p for the phenotype correlation graph on the 
performance of our methods, we evaluate the GFlasso methods with p at 0.1, 0.3, 0.5, and 0.7, 
and show the ROC curves in Figure |5l We include the ROC curves for the single-marker analysis, 
the ridge regression, and the lasso that do not use the thresholded phenotype correlation graph in 
each panel of Figure [5] repeatedly for the ease of comparison. We use the sample size = 100 
and the effect size 0.8. Regardless of the threshold p, the GcFlasso and GcFlasso outperform 
all of the other methods or perform at least as well as the lasso. As we have seen in Figure 
in the GcFlasso does not have the flexibility of accommodating edges of varying correlation 
strength in the phenotype correlation graph, and this negatively affects the performance of the 
GcFlasso at the low threshold p = 0.1 in Figure |4]\. As we increase the threshold p in Figures 
11^ and C, the phenotype correlation graph include only those edges with significant correlations. 
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Thus, the performance of GcFlasso approaches that of G^Flasso and G^Flasso, and the curves 
of the three methods in the GFlasso family almost entirely overlap. When the treshold is relatively 
high at p = 0.7, the number of edges in the graph is close to 0, effectively removing the fusion 
penalty. As a result, the performance of the graph-guided methods becomes close to the lasso. 
Overall, taking into account the correlation structure in phenotypes improves the detection rate of 
true causal SNPs. Once the phenotype correlation graph includes the edges that capture strong 
correlations, including more edges by further lowering the threshold p does not significantly affect 
the performance of GcFlasso and GcFlasso. The same tendency is shown in the prediction errors 
in Figure [6l 

We show an example of a simulated dataset and the estimated association strength in Figure |71 
using the sample size = 100 and effect size 0.8. Although the lasso is more successful in setting 
the regression coefficients of irrelevant SNPs to zero than the ridge regression, it still finds many 
SNPs as having a non-zero association strength. The GcFlasso, GcFlasso, and GcFlasso remove 
most of those spurious SNPs, and shows a clear block- structure in the estimated regression coeffi- 
cients, with each causal SNP spanning subgroups of of correlated phenotypes. Since GcFlasso uses 
only the information on the presence or absence of edges, when edges of weak correlation con- 
nect nodes across two true subgraphs, the GcFlasso is unable to ignore the weak edges, and fuses 
the effect of SNPs on the phenotypes across those two subgraphs. This undesirable property of 
GcFlasso disappears when we incorporate the edge weights in GcFlasso and GcFlasso. 

We show the computation time for solving a single optimization problem for the lasso, GcFlasso, 
and G„Flasso in Figure [8] for varying number of SNPs and phenotypes. 

3.2 Case Study Using Asthma Dataset 

Figure |9lA. shows the correlation matrix of the phenotypes after reordering the phenotypes using 
the agglomerative hierarchical clustering algorithm so that highly correlated phenotypes are clus- 
tered with a block structure along the diagonal. Using threshold p = 0.7, we obtain a phenotype 
correlation graph as shown in Figure[9^, where the white pixel at position {i, j) indicates that the 
z-th and j-th phenotypes are connected with an edge in the graph. The graph shows several blocks 
of white pixels representing densely connected subgraphs. We show the full graph in Figure [IJ We 
present results for the single-marker regression analysis, the ridge regression, the lasso, GcFlasso, 
GcFlasso, and GcFlasso in Figure |9]C!-H, respectively, where the rows represent phenotypes, 
and the columns correspond to genotypes, with bright pixels indicating high strength of associa- 
tion. The phenotypes in rows are rearranged according to the ordering given by the agglomerative 
hierarchical clustering so that each row in Figures |9lC-H is aligned with the phenotypes in the cor- 
relation matrix in Figure |9l\- In the fusion penalty in our proposed methods, we use the edges in 
Figure obtained at threshold p = 0.7. The graph obtained at threshold p = 0.7 seems to capture 
the previously known dependencies among the clinical traits such as subnetworks corresponding to 
lung physiology and quality of life. We select the regularization parameters in the lasso, GcFlasso, 
G,iFlasso, and GiFlasso using a five-fold cross validation. 

As shown in Figures |9]C and E, both the single-marker regression analysis and the lasso find a 
SNP near the top row, known as Q551R, as significantly associated with a block of correlated phe- 
notypes. This subset of traits corresponds to the bottom subnetwork (consisting of baselineFEVl, 
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Figure 9: Results for the association analysis of the asthma dataset. A: Phenotype correlation 
matrix. B: Phenotype correlation matrix thresholded at p = 0.7. C: -log(p-value) from single- 
marker statistical tests using a single-phenotype analysis. Estimated /3^,'s for D: ridge regression, 
E: lasso, F: GcFlasso, G: G^Flasso, and H: G^Flasso. 

PreFEFPred, AvgNO, BMI, PostbroPred, BaseFEVPer, PredrugFEVlP, MaxFEVlP, FEVlDiff, 
and PostFEF) that resides within the large subnetwork on the left-hand side of Figure [H and rep- 
resents traits related to lung physiology. This Q551R SNP has been previously found associated 
with severe asthma and its traits for lung physiology [1371 . and our results confirm this previous 
finding. In addition, the results from the single-marker analysis in Figure show that on the 
downstream of this SNP, there is a set of adjacent SNPs that appears to be in linkage disequilib- 
rium with this SNP and at the same time has generally a high level of association with the same 
subset of phenotypes. On the other hand, the lasso in Figure sets most of the regression coef- 
ficients for this block of SNPs in linkage disequilibrium with Q551R to zero, identifying a single 
SNP as significant. This confirms that the lasso is an effective method for finding sparse estimates 
of the regression coefficients, ignoring most of the irrelevant markers by setting corresponding re- 
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Table 1: Summary of results for the association analysis of the asthma dataset 
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11 




125 


123 
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gression coefficients to zero. The ridge regression as shown in Figure HJD does not have the same 
property of encouraging sparsity as the lasso. In fact, in statistical literature, it is well-known that 
the ridge regression performs poorly in problems that require a selection of a small number of 
markers affecting phenotypes. 

Since our methods in the GFlasso family include the lasso penalty, the results from GcFlasso, 
GcFlasso, and GcFlasso show the same property of sparsity as the lasso in their estimates, as 
can be seen in Figures (Qf-H. In addition, because of the fusion penalty, the regression coefficients 
estimated by our methods form a block structure, where the regression coefficients for each SNP 
are set to the same value within each block. Thus, each horizontal bar indicates a SNP influencing 
a correlated block of phenotypes. It is clear that the horizontal bars in Figures llf-H are generally 
aligned with the blocks of highly correlated phenotypes in Figure [9K- This block structure is much 
weaker in the results from the lasso in Figure |9^. For example. Figures |9p-H show that the SNPs 
rs3024660 and rs3024622 on the downstream of Q551R are associated with the same block of 
traits as Q551R, generating an interesting new hypothesis that these two SNPs as well as Q551R 
might be jointly associated with the same subset of clinical traits. This block structure shared by 
the two SNPs is not obvious in the results of single-marker tests and the lasso. 

We fit the lasso and our methods in the GFlasso family, while varying the threshold for the 
correlation graph, and summarize the results in Table[TJ When the threshold is high at p = 0.9, only 
a very small number of edges are included in the phenotype correlation graph, and the contribution 
of the graph-guided fusion penalty in GFlasso is low. Thus, the number of non-zero regression 
coefficients found by the GcFlasso, GcFlasso, and GcFlasso is similar to the result of the lasso 
that does not have the fusion penalty. When we lower the threshold to p = 0.7, the number of 
non-zero regression coefficients decreases significantly for our methods. As can be seen in Figure 
[9^, most of the significant correlation structure is captured in the thresholded correlation graph at 
p = 0.7. Thus, as we further lower the threshold, the number of non-zero regression coefficients 
generally remains unchanged. 
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4 Discussion 



When multiple phenotypes are involved in association mapping, it is important to combine the 
information across phenotypes and make use of the full information available in data in order to 
achieve the maximum power. Most of the previous approaches either considered each phenotype 
separately, or used a two-stage method that first extracts relatively primitive types of phenotype 
correlation structure such as phenotypes transformed through PCA or subgroups of phenotypes 
found by clustering algorithms, and then performs a single-phenotype analysis in the second stage. 
Networks or graphs have been extensively studied as a representation of correlation structure of 
phenotypes such as gene expression or clinical traits because they provide a flexible and explicit 
form of representation for capturing dependencies. Graphs contain rich information on phenotype 
interaction patterns such as densely connected subgraphs that can be interpreted as a cluster of phe- 
notypes participating in the same biological process. Applying a clustering algorithm to identify 
subgraphs as in the two-stage methods can potentially result in a loss of information, decreasing 
the power of the study. Developing a tool for multiple-phenotype association mapping that can 
directly leverage this full graph structure of phenotypes can offer a way to combine the large body 
of previous research in network analysis with the work on association mapping. 

In this article, we proposed a new family of regression methods called GFlasso that directly 
incorporates the correlation structure represented as a graph and uses this information to guide 
the estimation process. The methods consider all of the phenotypes jointly and estimates the 
model in a single statistical framework instead of using a two-stage algorithm. Often, we are 
interested in detecting genetic variations that perturb a sub-module of phenotypes rather than a 
single phenotype, and the GFlasso achieves this through fusion penalty in addition to the lasso 
penalty that encourages parsimony in the estimated model. The fusion penalty locally fuses two 
regression coefficients for a pair of correlated phenotypes, and this effect propagates through edges 
of graphs, effectively applying fusion to all of the nodes within each subgraph. The GcFlasso 
used an unweighted graph structure as a guide to find a subset of relevant covariates that jointly 
affect highly correlated outputs. The G„Flasso used additional information of edge weights to 
further add flexibility. Using simulated and asthma datasets, we demonstrated that including richer 
information on phenotype structure as in the GcFlasso and GcFlasso improves the accuracy in 
detecting true associations. 

We used a simple scheme of a thresholded correlation graph for learning the correlation struc- 
ture for phenotypes to be used in the GFlasso. Many different types of network-learning algortihms 
have been developed previously. For example, graphical Gaussian models (GGMs) are constructed 
based on partial correlations that capture the direct influence of interacting nodes, and have been 
commonly used for inferring gene networks from microarray data. Furthermore, in order to handle 
the case with a large number of nodes and a relatively small sample size, sparse GGMs have been 
developed. It would be interesting to see if using more sophisticated graph learning algorithms can 
improve the performance of the GFlasso. 

In this study, we assumed that the graph structure is available from pre-processing step. One 
of the possible extensions of the proposed method is to learn the graph structure and the regression 
coefficients jointly by combining the GFlasso with the graphical lasso [|39ll that learns sparse co- 
variance matrix for phenotypes. In Geronemo, both the module network structure and the markers 
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of regulators regulating modules were learned simultaneously, although the genotype information 
was limited to the markers within regulators [23 1. When the GFlasso is extended to learn the graph 
structure as well as regression coefficients, the method can be applied to a full genome- wide scan 
for associations. 

The GFlasso considered only dependencies among phenotypes, and did not assume any de- 
pendencies among the markers. Since recombination breaks chromosomes during meiosis at non- 
random sites, segments of chromosomes are inherited as a unit from ancestors to descendants rather 
than an individual nucleotide, creating a relatively low diversity in observed haplotypes than would 
be expected if each allele were inherited independently. Thus, SNPs with high linkage disequilib- 
rium are likely to contribute jointly to a phenotype in a regression-based penetrance function. By 
incorporating the structural information in both the genome and phenome, we expect to be able to 
identify a block of correlated markers influencing a set of correlated phenotypes. 

Appendix: Parameter Estimation 

In this section, we describe the procedure for obtaining estimates of the regression coefficients in 
GwFlasso. Since the GcFlasso is a special case of GwFlasso with /(r) = 1, the same procedure 
can be applied to GcFlasso in a straight-forward manner. The optimization problem in Equation 
([5]) is the Lagrangian form of the following optimization problem: 

GcFlasso : = argmin ^(y, - X(3,f ■ (y^ - (6) 

k 

S. t. XI l-^J'^l - firml) X \Pjm - Sign{rml)Pjl\ < 82- 

k j {m,l)£E j 

where si and S2 are tuning parameters corresponding to A and 7 in Equation ([5]). 

Since the objective function and constraints in Equation ^ are convex, we can formulate this 
problem as a quadratic programming (QP) as follows. Let f3^ denote a {J ■ K)-vector that can 
be obtained by concatenating /3fc's such that (3^ = {f3j , . . . ,/3]^)^. We represent Pjk = (3jf, — 
f3~j^, where > and (3j^. > 0, and let /3+ and denote ( J- Er)-vectors of /5^'s and (3~,^'s 
respectively. We define Oj^(^m,i) = l3j,m — sign(rmi)/3j^; for all (m, /) G E and j = 1, . . . , J, 
and let e,,^m,i) = ^,t(™,/) -'(^Um,i) ^+^,,) > and Oj^^^^,^ > 0. Let 6, = (Ol . . . , ^f^,)^, 
where 0e = (6*1,61 • • • ) (^J,e)^ for e = (m, I) G E. We define 6^ and 0^ similarly. Let M be a 
( J ■ \E\) X ( J ■ K) matrix, or equivalently a \E\ x K matrix of J x J sub-matrices. Each sub- 
matrix Be,fc of M for e = 1, . . . , \E\ and k = 1, . . . , K is an identity matrix if e = (m, /) and 
k = m. If e = {m,l) and A; = /, Be ^ is set to a diagonal matrix with —1 along the diagonal. 
Otherwise, Bg ^ is set to a matrix of O's. Let i? be a ( J ■ |i?|)-vector of \E\ sub-vectors with length 
J. Each sub-vector in R is set to f{rm,i) ■ Ij, where 1 j represents a J-vector of I's. Then, the QP 
problem for Equation Q can be written as 

min 5^(yfe - X(3,f ■ (y, - X/3,) (7) 

k 
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Figure 10: An example of the cross validation error surface over a grid of (si, S2) from graph- 
weighted fused lasso. 
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where / is a ( J ■ x (J ■ K) identity matrix, and 1(j-k) is a ( J ■ K)-vector of I's. 

The above QP procedure finds the optimal f3^'s for fixed si and S2. The si and S2 can be 
selected via cross-validation by running the QP procedure on a grid of Si and S2 and selecting the 
Si and S2 that give the lowest validation- set error C(si, S2) as was suggested for the fused lasso 
[|35l . The QP solver for G„FLasso runs reasonably fast for fixed Si and S2, but the grid search 
with a cross-validation can be time-consuming. Instead, we take a gradient-descent approach that 
iterates between solving the QP with the current values of si and S2 and updating si and S2 with 
(si, S2) <— (si, S2) — rjVC{si, S2), where the gradient is approximated by a finite difference vector 
c{s^,s2+h)-cis^,s2) -^_ Figure [10] shows a typical example of the cross-validation 
error over the grid of (si, S2) from G^Flasso. We exploit the shape of this error surface, and 
determine the initial values sf*^ and s^2^ for the gradient descent as follows. We first search for 
sf that produces the minimum cross-validation error by solving the lasso with S2 = 00. Then we 
fix si at sf \ and perform another one-dimensional search in the direction of S2, starting from 
to find the optimal sf^ for the G^Flasso along this path. In our experiments, we found that the 
initial values obtained by this procedure was sufficiently close to the global optimum, and that it 
converged to the optimum within a relatively small number of iterations. 
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